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We propose a stochastic model for the number of different words in a given database which 
incorporates the dependence on the database size and historical changes. The main feature of our 
model is the existence of two different classes of words: (i) a finite number of core-words which have 
higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining 
virtually infinite number of noncore-words which have lower frequency and once used reduce the 
probability of a new word to be used in the future. Our model relies on a careful analysis of 
the google-ngram database of books published in the last centuries and its main consequence is the 
generalization of Zipf 's and Heaps' law to two scaling regimes. We confirm that these generalizations 
yield the best simple description of the data among generic descriptive models and that the two 
free parameters depend only on the language but not on the database. From the point of view 
of our model the main change on historical time scales is the composition of the specific words 
included in the finite list of core- words, which we observe to decay exponentially in time with a rate 
of approximately 30 words per year. 
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I. INTRODUCTION 

The finitude of the vocabulary size is an intuitive idea. 
It finds a definite argument in the fact that words cannot 
be arbitrary large and are built from the combination of a 
finite number of phonemes or letters. On the other hand, 
even in our time of big data [1-3 there is no indication 
of a saturation of the vocabulary size (total number of 
different words) with increasing database size. In order 
to clarify whether the finitude of the vocabulary has any 
practical consequence, it is essential to understand not 
only the birth and death of words [IHS] , but also how the 
number of different words depends on the database size. 
The interest in this problem is motivated by linguistic 
studies [71 [HI as well as by applications in search engines, 
which require an estimation of the number of different 
words expected in a given database [OUTTj. 

The scaling between the number of different words, 
and the size of the database in words, M, as A/" ~ M"^ is 
known as Heaps' law [12] and has been studied in differ- 
ent linguistic [TJ"-T5] and non-linguistic [TGld^ contexts. 
The universality and interest of this empirical scaling 
is surpassed only by Zipf's law [18 , which states that 
the frequency F{r) of the r-th most frequent word in 
a database decays as F{r) 1/r. The relation between 
Heaps' and Zipf's law has been the subject of great recent 
interest [T9H2T] . Furthermore, it is well known that devi- 
ations of the Heaps'- and Zipf's-laws are observed in the 
tails of the Heaps'- and Zipf's- plots (i.e., for large N and 
r, respectively) [22,-24 . Similar deviations of fat-tailed 
distributions appear in a variety of social and physical 
systems [25l [26] and are crucial when extrapolating to 
the limit of large databases. 

In this paper we propose a stochastic growth model 
whose predictions go beyond the simpler scalings of 
Heaps' and Zipf's law and are compatible with actual ob- 
servations in the tail of the corresponding distributions. 
Our model is in the same spirit of, but differs from, the 



simpler versions of Yule's-, Simon's-, Gibrat's-, and pref- 
erential attachment- growth models [25l [27] , because it 
contains two categories of words and leads to two scaling 
regimes in the Heaps'- and Zipf's- plots. Our statistical 
analysis in the google-ngram database indicates that the 
only two free parameters of our model remain unchanged 
over centuries, depend only on the language, and that 
there is a slow change of words belonging to each cate- 
gory. The latter adds to the recent interest in language 
dynamics as a complex system [28l [29^ . 

The paper is organized as follows: in Sec.[TT]we present 
statistical analysis of the google-ngram database in terms 
of word frequencies as well as the growth of the vocab- 
ulary. This will then lead us to the formulation of our 



stochastic model for the vocabulary growth in Sec. HI 



In Sec.[IV|we investigate dynamical aspects on historical 
time scales within the framework of our model. 



II. DATA ANALYSIS 
Data 

The main motivation for our model comes from em- 
pirical observations. As databases, we use the google- 
ngram corpus [1] for English, German, French, Spanish, 
and Russian, which provide data of the word-frequencies 
(occurring in printed books) with a yearly resolution for 
a period of several hundred years (1520-2000). Our main 
interest in this database stems from its large size (several 
millions of books with > 10^^ words) and from the long 
time span it covers (thus enabling us to trace historical 
changes in the usage of language) . We consider as words 
only the 1-grams consisting uniquely by letters present in 
the alphabet of the corresponding language. This prag- 
matic definition guarantees that our observations are not 
affected by symbolic sequences, foreign words, numbers, 
or scanning problems. For each language we use two dif- 
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FIG. 1. Rank- frequency distribution shows double scaling behavior (Zipf s plot), a) Rank-frequency distribution for the English 
database y(2000) (solid) and a ML-fit of Eq. (IT]) (dashed), b+c) parameters 7 and b obtained from ML-fits of Eq. ^ to yearly 
y{t) (x-symbols) and accumulated Y{t) (solidjdatabase for t G [1805,2000]. Arrows indicate the values of the parameters 7* 
and b* obtained for the fit in a). 



ferent partitions of the database: i) yearly (?/), in which 
case y{t) corresponds to the database of the year t; and 
ii) cumulative (F), in which case Y{t) = ^^^yit). We 
consider only words which appeared at least n = 41 times 
in order to avoid biases due to the filtering mechanism 
used in the google-ngram database, see Supplementary 
Information (SI) Sec. I for further details. Here we show 
our detailed analysis for the largest database (English, 
N = 335,^0 = 1520, consistent data for t G [1805,2000]). 
For the other 4 languages we report the main findings 
and leave the details for the SI. 



Zipf's analysis 

Our first empirical analysis focus on the distribution 
of word frequencies. In his seminal work, Zipf proposed 
that the frequency of the r-th most frequent word in 
a given text is given by F{r) = F{l)/r ^8j. It is 
easy to see that this scaling has to break for large r: 
due to the divergence of the harmonic series for suffi- 
ciently large databases one arrives at Yl^=i ^(^) > ^ 
(sum of frequencies larger than text size). In English 

F(l) ^ 0.07 (the frequency of "the") and Y^r^i ^(^) > 1 
for N ^ 10^, meaning that F{r) has to decay faster than 
1/r for r > 10^. This well-known expectation, which is 
clearly seen in our data shown in Fig. [ija), motivated 
numerous different generalization of Zipf's proposal [30]- 
[32J. While many of these proposals were shown to pro- 
vide a better account of particular databases, they re- 
main in a great extent unsatisfactory because they lack 
the simplicity and universality of Zipf's original proposal 
(e.g., the parameters vary depending on the size, topic 
or date of publication of the analyzed texts ^33, .34 ). 



Motivated by the new magnitude of our large database, 
we apply rigorous statistical tests to determine which 
of the previously proposed distributions provide a bet- 
ter account of the data. We select 7 of the most pop- 
ular previously-proposed heavy-tailed distributions with 
at most 2 free parameters [SI [23l |24]: power-law, two 
power-laws, shifted power-law, log-normal, Weibull, and 
power-laws with exponential cutoffs (in the tail and be- 
ginning, respectively). The parameters for each distribu- 
tion were obtained numerically by means of Maximum 
Likelihood (ML) estimation [35 . In addition we i) calcu- 
late the probability that the data was generated by that 
model (x^ p- value [36^) and ii) compare which model is 
more likely to describe the data (relative likelihood [37]) 
for each fit (for details see Sl-Sec. II A). 

The results show that it is extremely unlikely (P < 
10~^^) that the data was drawn exactly from any of 
the proposed distributions, a consequence of the large 
databases which makes any small (true) deviation incom- 
patible with these simple fits. On the other hand, the re- 
sults show unequivocally that for English the distribution 
with two power-laws is clearly the best fit (1 — P < 10~^^) 
for all databases with a size of > 10^ tokens. 

We now discuss in detail the best two-parameter model 
we identify from our data: 

where 6, and 7 are free-parameters and C = C(7, 6) is 
the normalization constant. The effect of the threshold 
n applied to the frequency of words is that, in practice, 
data of F{r) is limited to F{r) > n/M (M is the observed 
number of tokens). The original Zipf's law is recovered 
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for high-frequency words and a critical rank r = b deter- 
mines a transition to a power-law with exponent 7. Dou- 
ble power-laws were proposed as a generalization of Zipf 's 
law in Ref. [38 and further investigated in Refs. [39", ^0^. 
These insightful works used distributions with two power- 
law exponents 71 , 72 and were motivated by the visual in- 
spection of double logarithmic plots. Our improved sta- 
tistical analysis confirm and extend these observations 
for the simpler distribution Eq. ([T]). Besides the like- 
lihood analysis and visual inspection given in Fig. [ij a 
third strong evidence in favor of distribution ([T]) comes 
from the comparison of the estimated parameters of dif- 
ferent corpora shown in Fig. [lib, c). Very similar values 
b e [7 - 10^, 12 • 10^] and 7 G [1.8,2.5] were obtained for 
non-overlapping databases, and the fluctuations become 
smaller for increasing database size. The observations 
strongly suggest that the same fixed parameters provide 
a good description of all English texts (e.g., ^(1900) and 
?/(2000)). In order to test this, hereafter we do not con- 
sider individual fits for each database and instead assume 
that Eq. Q is valid with b = b"" = I^IZ and 7 = 7* = 
1.77, values obtained for our largest database y(2000). 

Similar findings also apply to the other languages. In 
Tab. [T] we summarize the parameters 7* and b* obtained 
from a ML-fit of the largest database Y'(2000) of the re- 
spective language to Eq. Q. French and Spanish are 
also best described by Eq. ([T]) for databases exceeding 
a particular size and yield values for 7* and b* simi- 
lar to English. For German and Russian Eq. ([T]) con- 
stitutes only the second best model. However, we have 
strong indications that it provides a better account of the 
tails (r ^ b*) and therefore we expect that even larger 
databases will reveal the double power-law as the best 
fit also in these languages (see Sl-Sec. II B for details). 
Apart from being the smallest databases among the in- 
vestigated languages, another feature affecting the fitting 
in German and especially in Russian is the higher degree 
of inflection in the morphology of these languages. We 
recall that no lemmatization was applied in our defini- 
tion of words and, therefore, inflected words (obtained, 
e.g., by adding a suffix) are counted as new word- types. 
This reasoning explains the higher measured values of 6* 
(vocabulary in the regime). From the fitting per- 
spective, however, the large values of b* in German and 
Russian require even larger databases to characterize the 
deviations from the regime for r ^ b* . 



language 


6* 


* 

7 


English 


7, 873 


1.77 


French 


8,208 


1.78 


Spanish 


8,757 


1.78 


German 


19,863 


1.62 


Russian 


62,238 


1.94 



Heaps' analysis 

We now turn to our second empirical analysis: the de- 
pendence of the total number of different words (word- 
types, N) on the size of the database (in word-tokens, 
M). The classical result for this relation is the empir- 
ical Heaps' law [12 , which states that N ~ with 
A G [0, 1] {a ^ b indicates that a/b =constant for large 
b). We start searching for the consequences of our previ- 
ous observations in the Zipf's analysis to this new prob- 
lem. A simple and powerful approach is the so-called 
Zipfian ensemble (ZE) [21 , which assumes that the oc- 
currence of every possible word is governed by a Poisson 
process with an intensity proportional to its frequency 
(see Sl-Sec. HI A). It was shown that under this or simi- 
lar assumptions (e.g., stochastic processes with fixed fre- 
quencies for words), asymptotically Heaps' law can be 
interpreted as a direct consequence of a Zipfian rank fre- 
quency distribution F{r) ^ r'^ [i [I3l Ull US [21] and 
vice versa [20l [411 HI], where 7 = 1/A. Here we want 
to draw attention to the fact that these observations are 
not restricted to Zipf's and Heaps' laws, i.e., assuming 
a stochastic model, the relationship between F{r) and 
N{M) can always be established. The expectation of the 
ZE of Eq. Q with a threshold n > 1 is (see Sl-Sec. HI 
B) 



M > M5, 



(2) 



TABLE I. Parameters b* and 7* obtained from ML-fit of 
Eq. ^ obtained for the largest database y(2000) for all con- 
sidered languages. 



where M5 is the number of tokens such that N{Mi)) = b 
and the scaling constant Cn = C/n {C ^ F{1) being 
the frequency of the most common word, as can be seen 
from Eq. ([T])). Thus, the effect of the threshold n applied 
to the growth curve of the vocabulary simply amounts 
to rescaling the constant C. While the expected (aver- 
age) number of word-types over many realizations of the 
stochastic process leads to a sharp transition between the 
two regimes, the values of A^dp(^ ^ ^5) might depend 
more strongly on the particular realization. 

In Fig. [2] we show that the data in the google-ngram 
database obeys the scalings of Eq. ([2|. In Fig. [2|a) we 
present the N{M) curve for English. While for the yearly 
database y{t) we obtain a set of points for each t, the 
cumulative database Y{t) builds a curve of vocabulary 
growth for increasing t. Despite the differences in these 
databases, all the data lie in a relatively narrow region 
of the plot which resembles a single curve compatible 
with the double scaling of Eq. ([2|. This curve is well 
described by the N{M) curve obtained from the combi- 
nation of the double power-law distribution Eq. ([T]) with 
fixed parameters (7*, 6*) and the assumption of Poisson 
usage of words, in the spirit of the ZE. Similar obser- 
vations apply to all considered languages, as shown in 
Fig. [2|b). On closer inspection, Fig.|2|c), the fine details 
of the N{M) curve are not compatible with the fluctua- 
tions expected from the strongly simplifying assumptions 
of the ZE. It is, nevertheless, remarkable that the agree- 
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FIG. 2. Vocabulary A/" as a function of database size M (Heaps' plot), a) Number of word-types as a function of word-tokens 
for yearly y{t) (x-symbols) database, cumulative Y{t) (solid) database, and the Zipfian ensemble (dashed) assuming n = 41 
and the rank- frequency distribution Eq. ([T]) with 6* = 7873 and 7* = 1.77. b) Same curves as in a) but for different languages 
showing the same scaling behaviour. In order to increase visibility the curves for French, Spanish, and Russian were shifted, 
respectively, by one, two, and three decades with respect to their x- values, c) Difference of the curves in a): Deviation of 
the data y{t) and Y(t) (A^data) from the ZE growth curve (A/'ze). The dashed lines show the 95%-confidence interval of the 
ZE. d) Deviation of a ZE growth curve with a hypothetically finite vocabulary (A^finiteZE) from the ZE growth curve with 
infinite vocabulary (A/'ze) assuming rank- frequency distribution Eq. ([T]). Possible size of total vocabulary is given in units k 
of the number of observed types in y(2000), such that NS'i'" = /c • 4, 263, 717 with /c = 1, 2, 5, 10, 100. Since for M — ^ 00 : 
A/^finiteZE(M) — > A^ze"" ^hc deviation for /c = 1 becomes already large for M > 10^^ 



ment between model and data remains within 50% for 
different databases and over 9 orders of magnitude in 
size. 

Here it is worth revisiting the question about the fini- 
tude of the vocabulary. Even after more than 10^ differ- 
ent words the N{M) data in Fig. [2] does not seem to be 
saturated. To further investigate this point, we perform 
the ZE with the same rank-frequency distribution from 
Eq. ^ (fixed 6*, 7*) but varying the maximum possible 
number of different words N^^^ e.g., 1, 2, 5 ,10, and 
100 times the observed number of distinct words in our 
largest database y(2000). It can be seen in Fig. [2|d) 
that the differences for the predicted growth curves for 
such different hypothetical vocabulary sizes are negligi- 
ble compared to the fluctuations of the real data. From 
this we conclude that given the data accessible so far 
the possible vocabulary can be regarded for all practical 
purposes to be infinite (although bounded by combinato- 
rial arguments due to a finite alphabet and word length). 
The fact that the same distribution Eq. ([T]) with fixed 



parameters accounts for the observation across all years 
shows that the observation of different number of words 
is driven mainly by the different database size and not 
by a change in vocabulary richness over time. 



III. MODEL 

In this section we propose a simple generative model 
which recovers and allows for an improved interpreta- 
tion of the double scalings in our empirical findings - 
Eqs. ([1]) and ([2]). In line with the tradition of stochas- 
tic growth models explaining fat-tailed distributions, we 
consider an extension of Yule's model [25 which contains 
two classes of word- types: a core vocabulary and a non- 
core vocabulary [39 . At each step a word-token is drawn 
(M M + 1) and attributed to a word- type depending 
on probabilities specified below, see Fig. [3] for a sketch 
of the model. The total number of different word-types 
is given hy N = Nc -\- Nc^ where {Nc) Nc is the number 
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of (non) core- words. The new word-token can either be 
a new word- type (A^ i-^ + 1) with a probabihty Pnew 
or an aheady existing type {N N) with probabihty 
1 - Pnew In the latter case^ a (previously used) type is 
attributed to the token at random with probability pro- 
portional to the number of times this type has occurred 
before. In the former case, the new type can either origi- 
nate from a finite set of N^^^ core- words {Nc ^ A/'c + 1) 
with probability Pc or come from a potentially infinite set 
of noncore- words {Nc ^ A/'c + 1). In our simplest model 
we consider Pc to be a constant which becomes zero only 
if all core- words were drawn {Nc = N^^^): 



0° if < A^f 
3 if iV, = A/""^^^. 



(3) 



The final element of our model, which establishes the 
distinguishing aspect of core- words, is the dependence 
of Pnew on A^. By definition, we think of core-words as 
belonging to the central vocabulary of the language and, 
therefore, the usage of a new core- word does not affect 
the probability of using a new word- type in the future, 
i.e., Pnew = Pnew{Nc). On the other hand, if a noncore- 
word is drawn {Nc ^ Nc-\- 1) the Pnew for future choices 
should decrease. We saw that there is no indication of 
an upper bound on the vocabulary {pnew > 0) over many 
decades, which suggests that the reduction of Pnew with 
Nc is slow. Accordingly, we assume 



^ Pn 



1 



N^ 



(4) 



with the decay rate a > and the constant 5^1 which 
is introduced simply in order to damp the reduction of 
Pnew for small Nc (for simplicity, we use s = N^^^). 

We now show how this model recovers Eqs. ^ and ([2|. 
We require that 1 — p^ ^ 1^ which simply means that it is 
much more likely to draw core- words than noncore- words 
initially. In this case we can obtain approximately exact 
solutions for N{M) in the two limiting cases considered 
in Eq. When A^^ < A^^""'''', 1 -pc <Cl so that Pnew ^ 
const, and therefore we trivially obtain that A^ ~ M'^ 
with A 1. This case resembles the very beginning of the 
vocabulary growth, when most new word-types belong to 
the set of core- words. In the case Nc ^ A/"^^^, pc = 



and A^ 
limit: 



Nc so that Eq. Ml becomes in the continuum 



dN 



Pnew (A^) 



Pnew (A^) 



(5) 



from which it follows that Pnew ~ A^~". Thus we see that 
Eq. Q is a minimal assumption that guarantees that the 
vocabulary can in principle be of infinite size. 

We now obtain the expected growth curve N{M). No- 
tice that our model can be considered a biased random 
walk in A^, which, as an approximation, can be mapped 
onto a binomial random walk by the coordinate transfor- 
mation N{M) such that Pnew {N) = Pnew {N {M)) . The 



M-th word-token 







Is it a new 




word-type? 




^ Yes No i 


k/I -pnew{Nc] 



^ AT, + 1 



nTJ^ 
\^ \ I I 




Is it a 
core-word? 



Yes No 



Choose a previous 
word, proportional 
to freq 



^ Nc 

my^y\y\ I I I rTTTTI- 




FIG. 3. Illustration of our generative model for the usage of 
new words. 



resulting Poisson-Binomial process [43| can be treated 
analytically, e.g., the transformation N{M) is then given 
by the average of the vocabulary growth: 



N{M) 



M 



dM'p, 



N{M) 



dN' 



,{M') 
AM' 



dN' 



(6) 



Pn 



Using Pnew ^ A^~", this equation holds (self-consistently) 
only by assuming a sub-linear growth for the vocabulary 
N M^^ where the relation A = {1 -\- a)~^ is estab- 
lished. In accordance with Eq. ([2|, we identify the fol- 
lowing relation between the parameters: N^^^ = b and 
a = 7 — 1. The fitting parameters of Eq. ([T]) can thus be 
interpreted as: b is the size of the core vocabulary and 7 
controls the sensitivity of the probability of using a new 
word to the number of already used words in Eq. ([5|. 

Since the probability of usage for already used word- 
types is assumed to be proportional to the number of 
times it occurred before, we guarantee that Eq. ([2| im- 
plies ^ [20^ , meaning that the double scaling in the Zipf 
plot is also recovered from our generative model. While 
the previous arguments show that the correct scalings are 
obtained by our model, in order to obtain an agreement 
with the data it is essential to: (i) use the normalization 
constant C in order to determine the initial probability 
of finding a new word in Eq. Q; (ii) re-scale the distri- 
bution using the threshold n as M/n; and (iii) account 
for the disproportionally large weight of the first word 
types (in the Zipf plot). Taking these points into ac- 
count, direct simulations of the model in Fig. |3] with the 
traditional parameters b = b* and 7 = 7* lead to Zipf 's 
and Heaps' curves, which resemble the original fits. See 
SI- Sec. IV for all details. 
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FIG. 4. Historical change in the composition of core- words in the Enghsh vocabulary, a) fraction /(t, At) of core- words in 
y(t) that remain among this set in y(t + At) for t G [1805, 2000] (pale colors) and in particular for t = 1905 (black dots) with 
the corresponding exponential fit (red line), b+c) Parameters /o and k, in the exponential decay Eq. ([7| of the curves in a) 
obtained through least-square fits. Forward (backward) decay refers to At > (At < 0). 



IV. HISTORICAL CHANGES 

The model described so far has been shown to give a 
good account for all databases and all years with the 
same fixed two parameters N^^^ = h* = 7^ 873 and 
a = 7* — 1 = 0.77 in the case of English. A natural 
question is, therefore, what actually changes in histori- 
cal time scales? Considering two different databases (say 
two different years), our model does not consider any dif- 
ferences in the actual composition of the database. Even 
if the value of N^^^ remains constant this does not mean 
that the same word-types are observed for all years. From 
the point of view of our model, the main change a word- 
type can experience is to enter or to leave the group of 
core- words. 

Following this thought, we investigate the composition 
of the set of core- words in the yearly databases y{t) in the 
time t e [1805,2000] in Fig.|4| We calculate the fraction 
/(t. At) of core- words from y[t) that remains among this 
set in y{t-\-At). Figure [4]^a) shows that all curves can be 
qualitatively described by an exponential decay 

/(t,At) = /oe-'^l^*l, (7) 

independent of whether forward (At > 0) or backward 
time (At < 0) was considered. This is further supported 
in Fig. |4]^b-c), where the parameters /o and obtained 
numerically from a least-square fit [35] of Eq. ^ for all 
curves /(t,At) with t G [1805,2000] are presented. In 
order to avoid biases due to different number of points in 
the fit, for each t we performed a fit with the same num- 
ber of points min{2000 — t, t — 1805} forwards and back- 
wards in time. On closer inspection, two features con- 
nected to the interpretation of the parameters /o and hi 
deserve a more careful discussion. The parameter /o < 1 



represents the discontinuous change of core-words in two 
subsequent years. It strongly depends on the different 
selection of books in the construction of the respective 
databases and can be attributed to the finite size of 
the database, which leads to a wrong estimation of the 
"true" core- words. Consistently with this interpretation, 
Fig.[4]^b) shows that /o becomes smaller over time, due to 
the fact that database size increases leading to a better 
sampling of words. Nevertheless, a value of /o ~ 0.98 in- 
dicates that this is still far from being negligible (e.g., for 
^max _ §73 this means that around 150 words of the 
set of core- words will be different due to finite sampling) . 
In contrast, the decay rate Hi describes the continuous re- 
placement of core- words over time. The most intriguing 
observation in Fig. ^c) is that this change experiences 
an acceleration over time as grows by more than 50% 
from 1805 to 2000. 



V. DISCUSSION 

In summary, we have shown that the rank frequency 
distribution and the vocabulary growth of languages can 
be best described by simple two-scaling functions. The 
only two free parameters of the functions are related to 
each other and remain almost unchanged over centuries 
as well as databases and depend only on the considered 
language. We have also shown that these empirical find- 
ings can be interpreted as the result of a finite number 
of words belonging to a core vocabulary, which have dif- 
ferent properties from the remaining virtually unlimited 
number of words, as summarized in Tab.[ll} This conclu- 
sion was achieved based on a simple generative stochastic 
model for vocabulary growth, which should be considered 
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as a null model for the prediction of fluctuations and vo- 
cabulary sizes. Finally, we found that the composition 
of the core-vocabulary experiences an exponential decay 
with a rate of 30 words per year, which is, remarkably, 
steadily accelerating in the past decades. 





Core Words 


Non-core- words 


Number 


finite: \Nc\ G [10^ 10^] 


infinite: \Nc\ oo 


Frequency 


larger (r > 6*) 


smaller (r < 6*) 


Effect on Pnew 


none 


reduction 



TABLE II. Properties of core (c) and noncore (c) words in 
our model. 



It is worth comparing these flndings in view to previ- 
ous results. As far as we are aware, our analysis pro- 
vides the flrst rigorous statistical conflrmation of similar 
previous proposals [38H4Q] of the double-scaling gener- 
alizations of Zipf's law - Eq. ([T]). The consequence of 
this to vocabulary growth and Heap's law (see also [4Q ), 
which we drew based on a Poisson usage of words [21], 
is that the rate of introduction of new words decays but 
never vanishes with increasing database size. This is in 
contrast to recent claims which reported a convergence 
to a maximum vocabulary size [14 . We note that this 
previous analysis was based on single books and there- 
fore the database sizes were close to our transition point 
^max^ which we believe was misinterpreted with a sys- 
tematic decay. A generalization of a Yule's type process 
to obtain double-scaling degree distribution in a network 
of words was introduced in Ref. [44]. Two crucial dif- 
ferences to our model are that it yields fixed exponents 
and cannot be understood as a generative model of texts 
(token by token). Interestingly, in Ref. [6 an analysis of 
the network constructed from the thesaurus also showed 
the existence of a set of core-words almost of the same 
size as ours. 

Our simple model and expression for the vocabulary 
growth as a function of database size has important prac- 
tical consequences. Simply knowing the database size (in 
tokens, M, or potentially in bits), and using the language 
dependent parameters (C, N^^^^ a) reported above, from 
Eq. ^ one can immediately estimate the expected num- 
ber of different words, appearing more than n times. 
This is crucial for search engines and data mining pro- 
grams because it allows for an estimation of the mem- 
ory to be allocated prior to the scanning of an unknown 
database, e.g., in the construction of the inverted in- 
dex [QWTI . Even the ffuctuations around this expectation 
can be easily computed through our generative model 
or through the Poisson assumption of word-usage. Of 
course, this strong assumption ignores correlations and 
typically underestimates the expected ffuctuations, so 
that our model should be considered as the simplest null 
model. The existence of a transition between two scal- 
ings (which is under the reach of even single large books) 
shows that simple estimations based only on the tradi- 
tional Zipf's law have to be generalized. For instance, a 



commonly used index of vocabulary richness of a text is 
Herdan's coefficient given by the ratio log A^/ log M [8]. 
In view of our results, the coefficient is highly dependent 
on which of the two scaling regimes is reached with the 
given size of the text. 

We now compare our observations of change on histor- 
ical time scales to other historical changes in language 
usage. For the whole vocabulary, we obtain that its size 
is mainly driven by the available database size. This is 
in contrast to previous conclusions based on the same 
google-ngram database which detected a growth of vo- 
cabulary in time J]. Here it is important to note that 
this previous analysis included a substantially different 
filtering of the listed 1-grams to achieve valid words in the 
vocabulary, including a frequency criterion and manual 
classification. Still, our results show that also in this case 
a more careful analysis of the role of the database size is 
needed. For the core vocabulary, we observe a fairly con- 
stant number of constituents over centuries. The num- 
ber of words common to core-vocabularies of different 
databases was found to decay exponentially with the time 
between publication of the databases, e.g., for English the 
decay rate is approximately 30 words per year and the 
half-life of a word in the core vocabulary is 200 years. 
A decay rate for regularization of verbs was reported in 
Ref. [5 with a half-life varying between 750 and more 
than 10, 000 years as well as for a fundamental vocab- 
ulary of 200 words with a half-life varying between 300 
and 38, 800 years [4 . Perhaps our most intriguing finding 
is the approximately linear increase of the rate in time, 
which eventually confirms the overall acceleration of lan- 
guage change and society in general, as propagated in 
Ref. [1 . 

Our results can be extended in many directions and 
open new possibilities of studies of vocabulary change. 
Directly related to our observations and model, it re- 
mains to be explained the specific value of the parameter 
7* « 1.77, which is intriguingly similar across different 
languages. Another important point is to assess the lim- 
itations of our estimations due to the role of correlations 
inside real texts and databases, and how this could be in- 
troduced into our model. Furthermore, it remains to be 
shown whether the transition between two scalings due 
to the existence of a core vocabulary can be related to 
the phenomenon of phase transitions in ranking stability 
of complex systems recently reported in Ref. [45]. Fi- 
nally, we believe that our model provides the correct null 
model for normalizations due to database sizes and that 
therefore future investigations of historical effects on the 
vocabulary should take this into account. 
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Supplementary Information for: 
Stochastic model for the vocabulary growth of natural languages 

Martin Gerlach^ and Eduardo G. Altmann^ 
^ Max Planck Institute for the Physics of Complex Systems, 01187 Dresden, Germany 



I. DATA 



The data obtained from the google- ngram database [1 is filtered in two steps. First, we decapitalize each word 
(e.g. 'the' and 'The' are counted as the same word) and further restrict ourselves to words consisting uniquely of 
letters present in the alphabet of the corresponding language and the symbol " ' " (apostrophe). This is meant 
as a conservative approach in order to minimize the influence of foreign words, numbers (e.g. prices), or scanning 
problems which are present in the raw data. In the second step, when constructing yearly data y{t)^ i.e., words 
present in books published in year t, we include only those words in the database y{t)^ which appear more than 40 
times in that particular year. In the same way, for the cumulative data Y{t) we include only those words, which 
appeared more than 40 times until time t. In this way we avoid a possible bias due to the filtering applied in the 
construction of the raw data (words had to appear more than 40 times in all times in order to be included in the 
database [1 ). As an example of possible bias, in case we had not applied this filter, take two words (called '1' and 
'2') with Ni{t) = N2{t) = 21 occurrences in year t. If now W ^t : Ni{t') = and 3t'' ^ t : A^2(t'0 > 20, word '2' 
would be present in the raw data whereas word '1' would be not. As a result we would only include word '2' in the 
yearly database y{t). With our additional filter neither word '1' nor word '2' appears in the yearly database y{t). 



In Fig. SI we show the resulting database size for the yearly data y{t) and the cumulative data Y{t) = "^l^yit) 
in terms of tokens and types for English, French, Spanish, German, and Russian. In this context type refers to the 
number of distinct words, whereas token refers to the total number of words. 

For the yearly database y{t) we use data in the period t G [1805,2000], because as already indicated in [1 , the 
database composition may have changed in a noncontinuous way at t ^ 1800. This claim is supported in Fig. [S2j 
where we calculate Kendall's rank correlation coefficient [2] r[y{t)^y{t')] between the common types of the database 
y{t) and y{t') for 1500 <t<t' < 2000 as 

r[y{t)Mn= (SI) 



where n is the total number of common elements, Uc the number of concordant, and Ud the number of disconcordant 
pairs between the two databases with respect to the ranking of frequencies. Clearly, at t = 1800 a noncontinuous 
change in r can be identified, from which we conclude that database composition is dramatically different in the years 
before and after t = 1800. In order not to be affected by this change the yearly data y{t) is only considered in the 
period t G [1805, 2000]. However, in order to take advantage of the full size of the database, the cumulative data Y{t) 
is constructed taking into account all the years prior to t = 1805. 



II. MAXIMUM LIKELIHOOD ESTIMATION 



A. Theory 

In this section we give account of the distributions proposed for fitting the rank-frequency distribution and present 
the details of the Maximum likelihood estimation procedure. The procedures are standard [3 , but here we fit directly 
the rank frequency distribution originally proposed by Zipf [4] instead of the word frequency distribution considered 
in Ref. [5 . 



In Tab. [ST] the proposed descriptive models used to fit the rank-frequency distribution are presented. The notation 
F(r; Q) means that the distribution F depends on the rank r, and Vt is the set of parameters. The normalization 
constant C = C{Q) is a function of the respective parameters and fixed by "^^i F{r]Q) = 1. In practice, this is 
calculated with the Euler-Maclaurin formula available in the package mpmath |B]. 

The parameters of each distribution are estimated numerically by minimizing the negative of the log-Likelihood 

1]* =argmm/:'(l]), (S2) 
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i 


distribution 


F(r;Q) 


set of parameters Q 


1 


Power- Law 




7 


2 


Shifted Power-Law 


C(r + h)-^ 


7, b 


3 


Power-Law with Exponential cutoff (beginning) 


Cexp {—b/r) 


7, b 


4 


Power-Law with Exponential cutoff (tail) 


C exp {—br) 


7, 6 


5 


Log- normal 


Cr-^exp(-| {\nr-ijf /a^) 


/i, cr 


6 


Weibull 


Cr^-^exp(-6r-^) 


7, 6 


7 


Double Power- Law 


^fr-\ r<b 

1^7-1^-7 


7, b 



TABLE SI. Proposed models to fit rank-frequency distributions. 



where 

M 

(17) = - In £ = - ^ In F (r (i) ; . (S3) 



i=i 



In this expression M is the number of tokens, which implies that the sum goes over each observed token i and its 
corresponding rank r(i). In practice, the minimization is obtained with a Nelder-Mead simplex algorithm (available 
in the Scipy library 0). 

The quality of the fit was evaluated quantitatively by means of a p- value obtained from a x^-statistics [8]: 



X 



Here the domain is partitioned into Q cells, such that the expected number of observations per cell rij > 5 [£, with 
Nj being the actual observed number of observations in cell j. A recently proposed alternative strategy [10] involving 
the comparison of the Kolmogorov-Smirnow statistics of the actual empirical data with randomly generated data 
is computationally not feasible in this case, because it would require us to draw ^ 10^^ random numbers (p-value 
precision 0.01) due to the size of the database of > 10^^ tokens. 

In the last step we determine which of the proposed models i = l...i?, where R is the number different models 
considered, is most likely to describe the data. In order to account for the different number of fitted parameters we 
calculate the Akaike information criterion (AIC) [iTj for each model i 

AIC = 2C'{n'')^2K, (S5) 

where K is the number of parameters estimated in the model. The model which gives the minimum value AlCmin = 
mm{AICi} is most likely to describe the given data. From this we can calculate the relative likelihood k (12) 

i 

h=exp{-{AIC,-AIC^ir.)/2), (S6) 

which states how likely model i is to describe the data in comparison with the best model. This implies that the 
probability Wi that model i (out of the R models considered) describes the data is given by |T2j 

R 

Wi = P(modeH|data) = k/ ^^^j- (S7) 

3 = 1 



B. Results 



In this section we give a detailed overview of the results obtained from fitting the models in Tab. |S1| to the rank 
frequency distributions for all languages considered, i.e., English, French, Spanish, German, and Russian. In Fig. S3 



[S7|^a+b) we plot the AIC from the models in Tab. SI applied to yearly y{t) and cumulative data Y{t) of the 



respective language. In Fig.|S3|-|S7[c) we show explicitly the rank-frequency distribution of the data 1^(2000) and the 
corresponding fits of the three models that yield the best description: the double power-law {i = 7), the power-law 
with an exponential cutoff in the tail (z = 3), and the log- normal (z = 5). 
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For English, i = 7 yields the best description of the yearly data for t > 1950 and for the cumulative data for 
t > 1810. As the databases y{1950) and 1^(1810) can be considered independent datasets and by comparing with 



Fig. SI 'a) we conclude that the size of the database needs to exceed a certain threshold (~ 10^ tokens) in order 
to discriminate competing models like the z = 3 in the tail. This is further corroborated by looking at the inset 
in Fig. [SSj^c), where it can be seen that i = 7 outperforms z = 3, 5 especially in the description of the tail of the 
distribution. 

For the other languages except English the AIC of the yearly data y{t) favours i = 3. This comes with no surprise 



since their size is limited to < 10^ tokens for all t G [1805,2000], as can be seen in Fig. Slfa). In contrast, the 
cumulative data Y{t) shows different results. For French and Spanish the AIC favors i = 7 as the size of the database 
grows, especially for the largest dataset F(2000). Again, this becomes clear when looking at the deviations of the 



fits to the real data in the inset of Fig. S4 'c), |S5[c), which seem to diverge for i = 3, 5 in the tail of the distribution. 



For German and Russian the AIC identifies i = 7 only as the second best fit for the cumulative data Y{t). This 
is most probably due to the fact that the size of the database for those languages is still not large enough in order 
to discriminate a second power-law regime clearly. Additionally, for these languages the critical rank 6*, where a 
transition between the two power-laws occurs, is shifted towards higher values, possibly due to the different degree 
of inflection (see main text). This in turn implies that the fraction of tokens belonging to the power-law in the tail 
is much smaller than in English, which means that a larger database is needed in order to discriminate i = 3, 5. 



This claim is further supported by the insets of Fig. S6^c), S7^c), where we show that especially in the tail of the 
distribution i = 7 deviates less from the data than the competing models. 

Whereas English, French, and Spanish give approximately the same values for the largest database y(2000), German 
and Russian show larger values for b and a different power-law exponent in the tail (see main text). The latter might 
point towards more subtle differences between the languages besides inflection. 

III. ZIPFIAN ENSEMBLE 
A. Theory 

The Zipfian Ensemble (ZE) [13] is a simple approach to model the size of the vocabulary depending on the text 
length given the rank-frequency distribution F(r), r = l...N^^^ where G [1, oo) is the hypothetical (maximum) 

size of the vocabulary. The occurrence of each word-type with rank r is assumed to be governed independently by a 
Poisson process with an intensity equal to its frequency, e.g., A(r) = F(r), where time is measured in units of tokens 
M (text length). This means that the probability that this word- type occurs at least once in the interval Ti G [0, M] 
is given by [H] 

P (Ti < M; r) = 1 - g-^^^^^. (S8) 

From this we can calculate the vocabulary size N{M) by summing over all word-types, which gives the expected 
(average over realizations) number of words (out of different words in total) that appeared at least once up 

until time M: 

N{M) = £ [l - e-^(^)^] . (S9) 

r=l 



The variance of the ZE over the different realizations indicates the expected fluctuation around N{M) in Eq. (S9 
and is given by [13 : 

V [N{M)] = N{2M) - N{M). (SIO) 

Although similar, this framework differs from the usual 'bag-of-words' [15 in the sense that i) the expected time of 
occurrence of a word need not to be an integer and ii) two words can in principle occur at the same time due to the 
independence of the Poisson processes. This in turn limits the interpretation of the ZE as a model for the creation of 
a text token by token. However, it allows for an analytic treatment and the continuous time approximation becomes 
better in the limit of large databases. 

B. ZE in the double power- law 



In this section we want to show that a double power-law in the rank frequency distribution (Eq. (1) main text) can 
lead to the double scaling in the vocabulary growth (Eq. (2) in main text) in the framework of the ZE. 
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First, we generalize the ZE to cases where words have to appear at least n times before they are considered part of 
the vocabulary. The introduction of a threshold n means that instead of looking at the probability for the time until 
its first occurrence Ti, one considers T^, the time it takes until the word occurs n times and Eq. (S8 ) becomes 

n-l 



P(T, <M;r) = l-^ 



J 



(Sll) 



From this, Eq. (S9 ) can be directly extended to 



n-l 



3=0 



{F(r)Ml-Fir)M 



(S12) 



In the next step we consider the limit n ^ 1. As the stochastic variable is the sum of n times the stochastic 
variable Ti, which is distributed according to Eq. (S8 ), one can conclude that by means of the central limit theorem 



it follows that P (T^ = M/n; r) will approach a Gaussian with vanishing variance, such that by rescaling M M/n 
Eq. (Sll ) asymptotically becomes 



lim P {Tn < M/n] r) = O (M/n - r(r)) , 



(S13) 



where r(r) = 1/F{r) is the inverse of the frequency F{r) of the particular word-type and Q{x) is the Heaviside step 
function. For the vocabulary growth this yields 



lim N^''\M/n) = V [1 - 6 (M/n - r(r))] . 

r). — Von ' ^ 



(S14) 



Thus we obtain a direct relationship between the rank-frequency distribution and the vocabulary growth 

lim Ar(^)(M = 1/F(r)) = r. 



(S15) 



where M = M/n. 

In Fig. S8 we show the Ar(^)(M) curve obtained from the ZE for the double power-law (Eq.(l) main text) with 
parameters 7* = 1.77 and 6* = 7873 for different threshold s n. One can see that the growth curves for n > 8 are 
almost indistinguishable from the asymptotic solution Eq. (S15 ), which can be attributed to the fast convergence 
implied by the central limit theorem. 



From these observation we conclude that Eq. (S15 



is already a good approximation for n ^ 1, where in practice 
this can mean n > 10. As a result we obtain Eq. (2) from the main text. This means that the increase of the threshold 
n leads to a reduction of the fluctuations of the growth curve of the vocabulary and can be explained as a result of 



a simple stochastic process. In Fig. S9 we show that this claim holds when applied to real texts of the size of single 



books, as well as for a collection of several million books, as in Fig. SIQ 



IV. NUMERICAL SIMULATION OF STOCHASTIC MODEL 

In this section we show the results of the direct numerical simulation of the model proposed in Sec. Ill, main text. 



A. Parameters and initialization 



In order to simulate the model, apart from fixing a number of parameters (A^^^^, we need to prescribe how 

the model is initialized, e.g., what is the initial probability of using a new word and how many word types exist 
at the first iteration of the model. Concerning the parameters, the initial probability of choosing a core-word is set 
to = 0.99, such that 1 — <C 1 (see main text) and the two other parameters are fixed by the fitting parameters 
^^max _ ^* _ 7373^ Q/ = ^* — 1 = 0.77 in English, see main text). Concerning the initialization of the model, an 
important point that needs to be taken into account is that we are interested in retrieving t he H eaps' plot obtained 

(for simplicity and 



SIO 



after re-scaling the number of tokens M by the the threshold n as M = M/n, see Fig. 
computational efficiency in our simulations we choose n = 1). This implies that the first word type of our model 
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should on average appear not at the first token but instead approximately ai M ^ 1/F{1) (where F{1) is the frequency 
of the most frequent word). In view of this requirement, we set p^^^ = C = F^p(l) = 0.0922 (for English, see main 
text) and we start with an empty list of word types (the tokens used before the appearance of the first word type are 
counted but not attributed to any word type). The simulations were done with a maximum number of M = 10^ steps 
in units of tokens, a restriction imposed by the computational effort required. The reported results were obtained as 
the average of 100 realizations of the model. 



B. Heaps' plot 



In order to be able to compare the results with the google-ngram data, where a natural threshold of n = 41 is 
imposed (see SI-Sec. [l|, w e incorporate the threshold n by using the rescaled coordinate M = M/n, as motivated and 



discussed in Sl-Sec. 



IIIB 



In Fig. 



Sll 



we show the expected vocabulary growth N(M). We can clearly see that the 
two scaling regimes of Eq. (2), main text, are recovered from our model. Deviations to the data are within 50% over 
as much as 7 orders of magnitude. The poorer agreement for large M can be attributed to a slight overestimation 
by our model of the point of transition between the two scaling regimes. This could be addressed by modifying our 
model (e.g. modifying our simple choice of Pc in Eq. (3) main text) so that the decay in Pnew and the transition 
to the second scaling occurs already for shorter M. For even larger M we do not have data for our model due to 
computational limitations. However, based on our asymptotic calculations in Sec. HI, main text, we expect that the 
observed agreement will extend over the entire range of available data. 



C. Zipf's plot 

In the analysis of the Zipf's plot F{r) obtained by our model it is important to take into account that the Yule's 
type processes (already used words are drawn proportional to their previous occurrences) give a disproportionally 
large weight to the first word types used in the simulation. This happens because in the beginning of the simulation 



there are only a few existing word types into which all drawn tokens are attributed. Figure S12^a) illustrates this 
effect and shows that it is inversely proportional to Pnew^ which sets the time-scale for the appearance of new word 
types in the beginning of the simulation. This artifact can be easily addressed by excluding a few word types of 
smallest rank and re-normalizing the remaining distribution, as shown in Fig. |S12[ b-d). Alternatively, one can modify 
the preferential attachment part of the model (right-most branch in Fig. 3, main text) so that the very first used 
word-types follow a different rule and have a vanishing probability of usage for large M (notice that this would not 



modify the Heap's plot). For the case of English, Fig. S13 shows that the removal of only one word type (r = 1) is 
sufficient in order to obtain a good agreement with data (less than 50% of deviation over 7 orders of magnitude). As 
discussed in the case of Heaps' plot above, the transition to the second scaling appears slightly shifted in comparison 
to the data. 
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FIG. SI. Size of the database after filtering, a) Number of tokens for yearly data y{t) (x- symbols) and cumulative data Y{t) 
(line), b) Number of types for yearly data y{t) (x-symbols) and cumulative data Y{t) (line). Each language is marked by a 
different color. 




t\ year 



FIG. S2. Correlation between data in different years for English. Kendall's rank correlation Eq. (SI ) between yearly data y{t), 
y{t') for t^t' G [1500,2000] with t < t' . The dasehd lines show t — 1800 and t' — 1800 where a noncontinuous change in the 
correlation occurs. 
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FIG. S3. Discrimination between different models with AIC for English. Value of the AIC for a) yearly data y{t) b) cumulative 
data y(t). The inset shows the difference AAIC = Ald/M - AIC7/M, i = 1..6 meaning that if AAIC > the double power- 



law is the most likely model among the proposed describing the data. Numbers refer to the indices of the model in Tab. SI 
rank- frequency plot for y(2000) and the fits of the three best models. The inset shows the ratio Fdata{r) / F{it{r) . 
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FIG. S4. Same as in Fig.lSSlfor French. 
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FIG. S5. Same as in Fig. [S3] for Spanish. 
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FIG. S8. Influence of threshold n on size of vocabulary for the ZE. Growth curves N(M = M/n) obtained from ZE for double 
power-law (Eq.(l) main text) with parameters 7* = 1.77, 6* = 7873 with different thresholds n. The dashed curve shows the 
asymptotic solution Eq. (S15 ). 
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FIG. S9. Influence of threshold n on size of vocabulary for single books. Growth curves N{M — M/n) obtained from 4 different 
books with different thresholds n. a) Charles Darwin: "The Voyage of the Beagle" b) Mark Twain: "Life on the Mississippi" 
c) Miguel de Cervantes Saavedra: "Don Quixote", translated by John Ormsby d) Leo Tolstoy: "War and Peace", translated 
by Louise and Aylmer Maude. All texts were retrie ved f rom the Project Gutenberg (www.gutenberg.org) on 21.09.2010. 
The dashed curve shows the asymptotic solution Eq. (S15 ) of the ZE assuming a double power- law (Eq.(l) main text) with 
parameters 7* = 1.77, 6* = 7873. 
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FIG. SIO. Influence of threshold n on size of vocabulary for the English google-ngram database. Growth curves N{M = M/n) 
obtained from yearly data y{t) (x-symbol) and cumulative data Y{t) (line) for diffe rent v alues of the threshold n with n G [41, 10^] 
marked by different colors. The dashed curve shows the asymptotic solution Eq. (S15 ) of the ZE assuming a double power-law 
(Eq.(l) main text) with parameters 7* = 1.77, b* = 7873. 
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FIG. Sll. Vocabulary growth, N(M), from the numerical simualtion of our stochastic model (Heaps' plot). Number of word- 
types as a function of word-tokens of the English database for yearly (x-symbols) database, cumulative (solid) database, and 
the expectation from our stochastic model (dashed). Single realizations of the stochastic process are shown in thin/gray (solid). 
Each realization is calculated for an imaginary text of M = 10^ tokens. 
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FIG. S12. Influece of the first word types on tfie rank-frequency distribution of our model. Rank-frequency distribution F{r) 
from our numerical simulation with different values for Pnew ^ {0-1 5 0.01, 0.001, 0.0001} after filtering the k most frequent types, 
where a) /c = 0, b) /c = 1, c) /c = 3, and d) /c = 10. In this context, filtering means, that i) we neglect all tokens associated with 
ranks r = 1...A;; ii) the rank of all remaining types is lowered by /c, e.g., the rank of the k + 1-th most frequent type becomes 
r = 1; and iii) the distribution is renormalized such that F(r) = 1, where N is the number of types before the filtering. 
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FIG. S13. Rank frequency distribution, F{r), from the numerical simualtion of our stochastic model (Zipf's plot). Rank- 
frequency distribution for the English database y(2000) (solid) and the expectation from our stochastic model (dashed), where 
a) shows the unfiltered result, and b) shows the distribution after filtering the type which has rank r = 1 in a). Single 
realizations of the stochastic process are shown in thin/gray (solid). Each realization is calculated for an imaginary text of 
M 10^ tokens. 



